Ontology Suitability for Uncertain Extraction of Information from Multi-Record Web Documents

نویسندگان

  • David W. Embley
  • Norbert Fuhr
  • Claus-Peter Klas
  • Thomas Roelleke
چکیده

Ontology based data extraction from multi-record Web documents works well [ECLS98, ECJ98, ECJ99, EJN99], but only if the ontology is suitable for the Web document. How do we know whether the ontology is suitable? To resolve this question, we present an approach based on three heuristics: density, schema, and grouping. We encode the first heuristic as a density function and use probabilistic models for the second and third. We argue that these heuristics and our computational models for these heuristics correctly determine the suitability of a Web document for a given ontology.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

OLERA: On-Line Extraction Rule Analysis for Semi-structured Documents

The vast amount of online information available has led to renewed interest in information extraction (IE) systems that analyze input documents to produce a structured representation of selected information from the documents. Information extraction from semistructured documents has been studied extensively recently. Most researches focus on supervised learning approaches where targets must be ...

متن کامل

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

AUTOMATING THE EXTRACTION OF DOMAIN-SPECIFIC INFORMATION FROM THE WEB—A CASE STUDY FOR THE GENEALOGICAL DOMAIN by

AUTOMATING THE EXTRACTION OF DOMAIN SPECIFIC INFORMATION FROM THE WEB—A CASE STUDY FOR THE GENEALOGICAL DOMAIN Troy Walker Department of Computer Science Master of Science Current ways of finding genealogical information within the millions of pages on the Web are inadequate. In an effort to help genealogical researchers find desired information more quickly, we have developed GeneTIQS, a Genea...

متن کامل

Query Architecture Expansion in Web Using Fuzzy Multi Domain Ontology

Due to the increasing web, there are many challenges to establish a general framework for data mining and retrieving structured data from the Web. Creating an ontology is a step towards solving this problem. The ontology raises the main entity and the concept of any data in data mining. In this paper, we tried to propose a method for applying the "meaning" of the search system, But the problem ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Datenbank Rundbrief

دوره 24  شماره 

صفحات  -

تاریخ انتشار 1999